GBM Discussion

The data here is taken form the Data Hackathon3.x - http://datahack.analyticsvidhya.com/contest/data-hackathon-3x

Import Libraries:


In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import cross_validation, metrics
from sklearn.grid_search import GridSearchCV

import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 12, 4

Load Data:

The data has gone through following pre-processing:

  1. City variable dropped because of too many categories
  2. DOB converted to Age | DOB dropped
  3. EMI_Loan_Submitted_Missing created which is 1 if EMI_Loan_Submitted was missing else 0 | EMI_Loan_Submitted dropped
  4. EmployerName dropped because of too many categories
  5. Existing_EMI imputed with 0 (median) - 111 values were missing
  6. Interest_Rate_Missing created which is 1 if Interest_Rate was missing else 0 | Interest_Rate dropped
  7. Lead_Creation_Date dropped because made little intuitive impact on outcome
  8. Loan_Amount_Applied, Loan_Tenure_Applied imputed with missing
  9. Loan_Amount_Submitted_Missing created which is 1 if Loan_Amount_Submitted was missing else 0 | Loan_Amount_Submitted dropped
  10. Loan_Tenure_Submitted_Missing created which is 1 if Loan_Tenure_Submitted was missing else 0 | Loan_Tenure_Submitted dropped
  11. LoggedIn, Salary_Account removed
  12. Processing_Fee_Missing created which is 1 if Processing_Fee was missing else 0 | Processing_Fee dropped
  13. Source - top 2 kept as is and all others combined into different category
  14. Numerical and One-Hot-Coding performed

In [2]:
train = pd.read_csv('train_modified.csv')

In [3]:
target='Disbursed'
IDcol = 'ID'

In [6]:
train['Disbursed'].value_counts()


Out[6]:
0.0    85747
1.0     1273
Name: Disbursed, dtype: int64

In [ ]:

Define a function for modeling and cross-validation

This function will do the following:

  1. fit the model
  2. determine training accuracy
  3. determine training AUC
  4. determine testing AUC
  5. perform CV is performCV is True
  6. plot Feature Importance if printFeatureImportance is True

In [7]:
def modelfit(alg, dtrain, dtest, predictors, performCV=True, printFeatureImportance=True, cv_folds=5):
    #Fit the algorithm on the data
    alg.fit(dtrain[predictors], dtrain['Disbursed'])
        
    #Predict training set:
    dtrain_predictions = alg.predict(dtrain[predictors])
    dtrain_predprob = alg.predict_proba(dtrain[predictors])[:,1]
    
    #Perform cross-validation:
    if performCV:
        cv_score = cross_validation.cross_val_score(alg, dtrain[predictors], dtrain['Disbursed'], cv=cv_folds, scoring='roc_auc')
    
    #Print model report:
    print "\nModel Report"
    print "Accuracy : %.4g" % metrics.accuracy_score(dtrain['Disbursed'].values, dtrain_predictions)
    print "AUC Score (Train): %f" % metrics.roc_auc_score(dtrain['Disbursed'], dtrain_predprob)
    
    if performCV:
        print "CV Score : Mean - %.7g | Std - %.7g | Min - %.7g | Max - %.7g" % (np.mean(cv_score),np.std(cv_score),np.min(cv_score),np.max(cv_score))
                
    #Print Feature Importance:
    if printFeatureImportance:
        feat_imp = pd.Series(alg.feature_importances_, predictors).sort_values(ascending=False)
        feat_imp.plot(kind='bar', title='Feature Importances')
        plt.ylabel('Feature Importance Score')


  File "<ipython-input-7-6e1003e5e61f>", line 14
    print "\nModel Report"
                         ^
SyntaxError: Missing parentheses in call to 'print'

Baseline Model

Since here the criteria is AUC, simply predicting the most prominent class would give an AUC of 0.5 always. Another way of getting a baseline model is to use the algorithm without tuning, i.e. with default parameters.


In [9]:
#Choose all predictors except target & IDcols
predictors = [x for x in train.columns if x not in [target, IDcol]]
gbm0 = GradientBoostingClassifier(random_state=10)
modelfit(gbm0, train, test, predictors,printOOB=False)


Model Report
Accuracy : 0.9856
AUC Score (Train): 0.862264
CV Score : Mean - 0.8319 | Std - 0.008757 | Min - 0.8208 | Max - 0.8439

GBM Models:

There 2 types of parameters here:

  1. Tree-specific parameters
    • min_samples_split
    • min_samples_leaf
    • max_depth
    • min_leaf_nodes
    • max_features
    • loss function
  2. Boosting specific paramters
    • n_estimators
    • learning_rate
    • subsample

Approach for tackling the problem

  1. Decide a relatively higher value for learning rate and tune the number of estimators requried for that.
  2. Tune the tree specific parameters for that learning rate
  3. Tune subsample
  4. Lower learning rate as much as possible computationally and increase the number of estimators accordingly.

Step 1- Find the number of estimators for a high learning rate

We will use the following benchmarks for parameters:

  1. min_samples_split = 500 : ~0.5-1% of total values. Since this is imbalanced class problem, we'll take small value
  2. min_samples_leaf = 50 : Just using for preventing overfitting. will be tuned later.
  3. max_depth = 8 : since high number of observations and predictors, choose relatively high value
  4. max_features = 'sqrt' : general thumbrule to start with
  5. subsample = 0.8 : typically used value (will be tuned later)

0.1 is assumed to be a good learning rate to start with. Let's try to find the optimum number of estimators requried for this.


In [10]:
#Choose all predictors except target & IDcols
predictors = [x for x in train.columns if x not in [target, IDcol]]
param_test1 = {'n_estimators':range(20,81,10)}
gsearch1 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1, min_samples_split=500,
                                  min_samples_leaf=50,max_depth=8,max_features='sqrt', subsample=0.8,random_state=10), 
                       param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch1.fit(train[predictors],train[target])


Out[10]:
GridSearchCV(cv=5, error_score='raise',
       estimator=GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',
              max_depth=8, max_features='sqrt', max_leaf_nodes=None,
              min_samples_leaf=50, min_samples_split=500,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              presort='auto', random_state=10, subsample=0.8, verbose=0,
              warm_start=False),
       fit_params={}, iid=False, n_jobs=4,
       param_grid={'n_estimators': [20, 30, 40, 50, 60, 70, 80]},
       pre_dispatch='2*n_jobs', refit=True, scoring='roc_auc', verbose=0)

In [11]:
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_


Out[11]:
([mean: 0.83322, std: 0.00985, params: {'n_estimators': 20},
  mean: 0.83684, std: 0.00986, params: {'n_estimators': 30},
  mean: 0.83752, std: 0.00978, params: {'n_estimators': 40},
  mean: 0.83761, std: 0.00991, params: {'n_estimators': 50},
  mean: 0.83843, std: 0.00987, params: {'n_estimators': 60},
  mean: 0.83832, std: 0.00956, params: {'n_estimators': 70},
  mean: 0.83764, std: 0.01001, params: {'n_estimators': 80}],
 {'n_estimators': 60},
 0.83842766395593704)

So we got 60 as the optimal estimators for the 0.1 learning rate. Note that 60 is a reasonable value and can be used as it is. But it might not be the same in all cases. Other situations:

  1. If the value is around 20, you might want to try lowering the learning rate to 0.05 and re-run grid search
  2. If the values are too high ~100, tuning the other parameters will take long time and you can try a higher learning rate

Step 2- Tune tree-specific parameters

Now, lets move onto tuning the tree parameters. We will do this in 2 stages:

  1. Tune max_depth and num_samples_split
  2. Tune min_samples_leaf
  3. Tune max_features

In [13]:
#Grid seach on subsample and max_features
param_test2 = {'max_depth':range(5,16,2), 'min_samples_split':range(200,1001,200)}
gsearch2 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1, n_estimators=60,
                                                max_features='sqrt', subsample=0.8, random_state=10), 
                       param_grid = param_test2, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch2.fit(train[predictors],train[target])


Out[13]:
GridSearchCV(cv=5, error_score='raise',
       estimator=GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',
              max_depth=3, max_features='sqrt', max_leaf_nodes=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=60,
              presort='auto', random_state=10, subsample=0.8, verbose=0,
              warm_start=False),
       fit_params={}, iid=False, n_jobs=4,
       param_grid={'min_samples_split': [200, 400, 600, 800, 1000], 'max_depth': [5, 7, 9, 11, 13, 15]},
       pre_dispatch='2*n_jobs', refit=True, scoring='roc_auc', verbose=0)

In [14]:
gsearch2.grid_scores_, gsearch2.best_params_, gsearch2.best_score_


Out[14]:
([mean: 0.83256, std: 0.01272, params: {'min_samples_split': 200, 'max_depth': 5},
  mean: 0.83285, std: 0.01016, params: {'min_samples_split': 400, 'max_depth': 5},
  mean: 0.83386, std: 0.01415, params: {'min_samples_split': 600, 'max_depth': 5},
  mean: 0.83379, std: 0.01169, params: {'min_samples_split': 800, 'max_depth': 5},
  mean: 0.83338, std: 0.01268, params: {'min_samples_split': 1000, 'max_depth': 5},
  mean: 0.83390, std: 0.00759, params: {'min_samples_split': 200, 'max_depth': 7},
  mean: 0.83660, std: 0.00994, params: {'min_samples_split': 400, 'max_depth': 7},
  mean: 0.83481, std: 0.00827, params: {'min_samples_split': 600, 'max_depth': 7},
  mean: 0.83788, std: 0.01066, params: {'min_samples_split': 800, 'max_depth': 7},
  mean: 0.83769, std: 0.01060, params: {'min_samples_split': 1000, 'max_depth': 7},
  mean: 0.83631, std: 0.00942, params: {'min_samples_split': 200, 'max_depth': 9},
  mean: 0.83695, std: 0.00923, params: {'min_samples_split': 400, 'max_depth': 9},
  mean: 0.83339, std: 0.00893, params: {'min_samples_split': 600, 'max_depth': 9},
  mean: 0.83793, std: 0.00965, params: {'min_samples_split': 800, 'max_depth': 9},
  mean: 0.83844, std: 0.00954, params: {'min_samples_split': 1000, 'max_depth': 9},
  mean: 0.83036, std: 0.00998, params: {'min_samples_split': 200, 'max_depth': 11},
  mean: 0.83077, std: 0.00809, params: {'min_samples_split': 400, 'max_depth': 11},
  mean: 0.83366, std: 0.00983, params: {'min_samples_split': 600, 'max_depth': 11},
  mean: 0.83193, std: 0.00911, params: {'min_samples_split': 800, 'max_depth': 11},
  mean: 0.83582, std: 0.01040, params: {'min_samples_split': 1000, 'max_depth': 11},
  mean: 0.82198, std: 0.01037, params: {'min_samples_split': 200, 'max_depth': 13},
  mean: 0.83055, std: 0.00837, params: {'min_samples_split': 400, 'max_depth': 13},
  mean: 0.83139, std: 0.01127, params: {'min_samples_split': 600, 'max_depth': 13},
  mean: 0.83403, std: 0.01060, params: {'min_samples_split': 800, 'max_depth': 13},
  mean: 0.83288, std: 0.00974, params: {'min_samples_split': 1000, 'max_depth': 13},
  mean: 0.82009, std: 0.00691, params: {'min_samples_split': 200, 'max_depth': 15},
  mean: 0.82317, std: 0.01017, params: {'min_samples_split': 400, 'max_depth': 15},
  mean: 0.82909, std: 0.00904, params: {'min_samples_split': 600, 'max_depth': 15},
  mean: 0.82926, std: 0.00944, params: {'min_samples_split': 800, 'max_depth': 15},
  mean: 0.83236, std: 0.01421, params: {'min_samples_split': 1000, 'max_depth': 15}],
 {'max_depth': 9, 'min_samples_split': 1000},
 0.83843938077464664)

Since we reached the maximum of min_sales_split, we should check higher values as well. Also, we can tune min_samples_leaf with it now as max_depth is fixed. One might argue that max depth might change for higher value but if you observe the output closely, a max_depth of 9 had a better model for most of cases. So lets perform a grid search on them:


In [17]:
#Grid seach on subsample and max_features
param_test3 = {'min_samples_split':range(1000,2100,200), 'min_samples_leaf':range(30,71,10)}
gsearch3 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1, n_estimators=60,max_depth=9,
                                                    max_features='sqrt', subsample=0.8, random_state=10), 
                       param_grid = param_test3, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch3.fit(train[predictors],train[target])


Out[17]:
GridSearchCV(cv=5, error_score='raise',
       estimator=GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',
              max_depth=9, max_features='sqrt', max_leaf_nodes=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=60,
              presort='auto', random_state=10, subsample=0.8, verbose=0,
              warm_start=False),
       fit_params={}, iid=False, n_jobs=4,
       param_grid={'min_samples_split': [1000, 1200, 1400, 1600, 1800, 2000], 'min_samples_leaf': [30, 40, 50, 60, 70]},
       pre_dispatch='2*n_jobs', refit=True, scoring='roc_auc', verbose=0)

In [18]:
gsearch3.grid_scores_, gsearch3.best_params_, gsearch3.best_score_


Out[18]:
([mean: 0.83821, std: 0.01092, params: {'min_samples_split': 1000, 'min_samples_leaf': 30},
  mean: 0.83889, std: 0.01271, params: {'min_samples_split': 1200, 'min_samples_leaf': 30},
  mean: 0.83552, std: 0.01024, params: {'min_samples_split': 1400, 'min_samples_leaf': 30},
  mean: 0.83683, std: 0.01429, params: {'min_samples_split': 1600, 'min_samples_leaf': 30},
  mean: 0.83958, std: 0.01233, params: {'min_samples_split': 1800, 'min_samples_leaf': 30},
  mean: 0.83783, std: 0.01137, params: {'min_samples_split': 2000, 'min_samples_leaf': 30},
  mean: 0.83821, std: 0.00872, params: {'min_samples_split': 1000, 'min_samples_leaf': 40},
  mean: 0.83740, std: 0.01280, params: {'min_samples_split': 1200, 'min_samples_leaf': 40},
  mean: 0.83714, std: 0.01019, params: {'min_samples_split': 1400, 'min_samples_leaf': 40},
  mean: 0.83771, std: 0.01188, params: {'min_samples_split': 1600, 'min_samples_leaf': 40},
  mean: 0.83738, std: 0.01370, params: {'min_samples_split': 1800, 'min_samples_leaf': 40},
  mean: 0.83765, std: 0.01221, params: {'min_samples_split': 2000, 'min_samples_leaf': 40},
  mean: 0.83575, std: 0.01017, params: {'min_samples_split': 1000, 'min_samples_leaf': 50},
  mean: 0.83744, std: 0.01224, params: {'min_samples_split': 1200, 'min_samples_leaf': 50},
  mean: 0.83892, std: 0.01234, params: {'min_samples_split': 1400, 'min_samples_leaf': 50},
  mean: 0.83814, std: 0.01354, params: {'min_samples_split': 1600, 'min_samples_leaf': 50},
  mean: 0.83824, std: 0.01116, params: {'min_samples_split': 1800, 'min_samples_leaf': 50},
  mean: 0.83821, std: 0.01014, params: {'min_samples_split': 2000, 'min_samples_leaf': 50},
  mean: 0.83626, std: 0.01111, params: {'min_samples_split': 1000, 'min_samples_leaf': 60},
  mean: 0.83959, std: 0.00989, params: {'min_samples_split': 1200, 'min_samples_leaf': 60},
  mean: 0.83735, std: 0.01217, params: {'min_samples_split': 1400, 'min_samples_leaf': 60},
  mean: 0.83685, std: 0.01325, params: {'min_samples_split': 1600, 'min_samples_leaf': 60},
  mean: 0.83589, std: 0.01101, params: {'min_samples_split': 1800, 'min_samples_leaf': 60},
  mean: 0.83769, std: 0.01173, params: {'min_samples_split': 2000, 'min_samples_leaf': 60},
  mean: 0.83792, std: 0.00994, params: {'min_samples_split': 1000, 'min_samples_leaf': 70},
  mean: 0.83712, std: 0.01053, params: {'min_samples_split': 1200, 'min_samples_leaf': 70},
  mean: 0.83777, std: 0.01186, params: {'min_samples_split': 1400, 'min_samples_leaf': 70},
  mean: 0.83812, std: 0.01126, params: {'min_samples_split': 1600, 'min_samples_leaf': 70},
  mean: 0.83812, std: 0.01055, params: {'min_samples_split': 1800, 'min_samples_leaf': 70},
  mean: 0.83677, std: 0.01190, params: {'min_samples_split': 2000, 'min_samples_leaf': 70}],
 {'min_samples_leaf': 60, 'min_samples_split': 1200},
 0.83959087132827259)

In [49]:
modelfit(gsearch3.best_estimator_, train, test, predictors)


Model Report
Accuracy : 0.9854
AUC Score (Train): 0.896453
CV Score : Mean - 0.8395909 | Std - 0.009890497 | Min - 0.8259075 | Max - 0.8527672

Tune max_features:


In [20]:
#Grid seach on subsample and max_features
param_test4 = {'max_features':range(7,20,2)}
gsearch4 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1, n_estimators=60,max_depth=9, 
                            min_samples_split=1200, min_samples_leaf=60, subsample=0.8, random_state=10),
                       param_grid = param_test4, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch4.fit(train[predictors],train[target])


Out[20]:
GridSearchCV(cv=5, error_score='raise',
       estimator=GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',
              max_depth=9, max_features=None, max_leaf_nodes=None,
              min_samples_leaf=60, min_samples_split=1200,
              min_weight_fraction_leaf=0.0, n_estimators=60,
              presort='auto', random_state=10, subsample=0.8, verbose=0,
              warm_start=False),
       fit_params={}, iid=False, n_jobs=4,
       param_grid={'max_features': [7, 9, 11, 13, 15, 17, 19]},
       pre_dispatch='2*n_jobs', refit=True, scoring='roc_auc', verbose=0)

In [22]:
gsearch4.grid_scores_, gsearch4.best_params_, gsearch4.best_score_


Out[22]:
([mean: 0.83959, std: 0.00989, params: {'max_features': 7},
  mean: 0.83648, std: 0.00988, params: {'max_features': 9},
  mean: 0.83919, std: 0.01042, params: {'max_features': 11},
  mean: 0.83738, std: 0.01017, params: {'max_features': 13},
  mean: 0.83820, std: 0.01017, params: {'max_features': 15},
  mean: 0.83495, std: 0.00957, params: {'max_features': 17},
  mean: 0.83499, std: 0.00996, params: {'max_features': 19}],
 {'max_features': 7},
 0.83959087132827259)

Step3- Tune Subsample and Lower Learning Rate


In [23]:
#Grid seach on subsample and max_features
param_test5 = {'subsample':[0.6,0.7,0.75,0.8,0.85,0.9]}
gsearch5 = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1, n_estimators=60,max_depth=9, 
                            min_samples_split=1200, min_samples_leaf=60, subsample=0.8, random_state=10, max_features=7),
                       param_grid = param_test5, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch5.fit(train[predictors],train[target])


Out[23]:
GridSearchCV(cv=5, error_score='raise',
       estimator=GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',
              max_depth=9, max_features=7, max_leaf_nodes=None,
              min_samples_leaf=60, min_samples_split=1200,
              min_weight_fraction_leaf=0.0, n_estimators=60,
              presort='auto', random_state=10, subsample=0.8, verbose=0,
              warm_start=False),
       fit_params={}, iid=False, n_jobs=4,
       param_grid={'subsample': [0.6, 0.7, 0.75, 0.8, 0.85, 0.9]},
       pre_dispatch='2*n_jobs', refit=True, scoring='roc_auc', verbose=0)

In [24]:
gsearch5.grid_scores_, gsearch5.best_params_, gsearch5.best_score_


Out[24]:
([mean: 0.83621, std: 0.00950, params: {'subsample': 0.6},
  mean: 0.83648, std: 0.01181, params: {'subsample': 0.7},
  mean: 0.83601, std: 0.01074, params: {'subsample': 0.75},
  mean: 0.83959, std: 0.00989, params: {'subsample': 0.8},
  mean: 0.83989, std: 0.01078, params: {'subsample': 0.85},
  mean: 0.83827, std: 0.01076, params: {'subsample': 0.9}],
 {'subsample': 0.85},
 0.83988852960292915)

With all tuned lets try reducing the learning rate and proportionally increasing the number of estimators to get more robust results:


In [26]:
#Choose all predictors except target & IDcols
predictors = [x for x in train.columns if x not in [target, IDcol]]
gbm_tuned_1 = GradientBoostingClassifier(learning_rate=0.05, n_estimators=120,max_depth=9, min_samples_split=1200, 
                                         min_samples_leaf=60, subsample=0.85, random_state=10, max_features=7)
modelfit(gbm_tuned_1, train, test, predictors)


Model Report
Accuracy : 0.9854
AUC Score (Train): 0.897471
CV Score : Mean - 0.8396 | Std - 0.009514 | Min - 0.8266 | Max - 0.8516

1/10th learning rate


In [29]:
#Choose all predictors except target & IDcols
predictors = [x for x in train.columns if x not in [target, IDcol]]
gbm_tuned_2 = GradientBoostingClassifier(learning_rate=0.01, n_estimators=600,max_depth=9, min_samples_split=1200, 
                                         min_samples_leaf=60, subsample=0.85, random_state=10, max_features=7)
modelfit(gbm_tuned_2, train, test, predictors)


Model Report
Accuracy : 0.9854
AUC Score (Train): 0.899927
CV Score : Mean - 0.8409339 | Std - 0.01035658 | Min - 0.8258238 | Max - 0.8529458

1/50th learning rate


In [43]:
#Choose all predictors except target & IDcols
predictors = [x for x in train.columns if x not in [target, IDcol]]
gbm_tuned_3 = GradientBoostingClassifier(learning_rate=0.005, n_estimators=1200,max_depth=9, min_samples_split=1200, 
                                         min_samples_leaf=60, subsample=0.85, random_state=10, max_features=7,
                                         warm_start=True)
modelfit(gbm_tuned_3, train, test, predictors, performCV=False)


Model Report
Accuracy : 0.9854
AUC Score (Train): 0.900688

In [46]:
#Choose all predictors except target & IDcols
predictors = [x for x in train.columns if x not in [target, IDcol]]
gbm_tuned_4 = GradientBoostingClassifier(learning_rate=0.005, n_estimators=1500,max_depth=9, min_samples_split=1200, 
                                         min_samples_leaf=60, subsample=0.85, random_state=10, max_features=7,
                                         warm_start=True)
modelfit(gbm_tuned_4, train, test, predictors, performCV=False)


Model Report
Accuracy : 0.9854
AUC Score (Train): 0.906346